Phase 6.4: epoll-driven RX dispatch with adaptive timeout (3.5× throughput vs main)#69
Merged
Phase 6.4: epoll-driven RX dispatch with adaptive timeout (3.5× throughput vs main)#69
Conversation
Adds two planning docs under docs/superpowers/plans/: - 2026-04-27-smoltcp-passt-port.md (spec) Supersedes the 2026-04-12 network-backend-abstraction design. Replaces "add passt as opt-in backend" with "lift passt's design patterns into our smoltcp stack" — keeps observability, all-Rust path, single binary, cross-platform parity. Lists required skills for execution (rust-style, rustdoc, rust-analyzer-ssr, superpowers TDD/verification, repo verify/profile). Maps the work into 5+1 phases with per-phase plan-doc placeholders. - 2026-04-27-smoltcp-passt-port-phase0.md (Phase 0 plan) 25 bite-sized TDD tasks: correctness baseline pins, divan microbenches, wall-clock e2e harness, NetworkBackend trait extraction, SlirpStack → SmoltcpBackend rename. Includes three BROKEN_ON_PURPOSE assertions that flip in later phases.
Add two baseline tests for the smoltcp DNS proxy: - dns_query_resolves: sends a query for example.com, polls ≤20×100ms, asserts reply XID matches. - dns_cache_keys_by_question_not_xid: warms cache with xid=1, then queries with xid=2 and asserts the stack rewrites the reply XID. Both tests skip gracefully (eprintln + early return) when the upstream resolver is unreachable, making them safe in offline CI. Also adds QNAME_EXAMPLE_COM const and two module-scope helpers: build_dns_query (builds a correct UDP DNS frame with proper payload_len) and parse_dns_reply_xid. SLIRP_DNS_IP added to the existing module-scope slirp import.
Implement measure_tcp_throughput_g2h: binds a host-side TCP listener, boots a VM, execs dd|nc in the guest, drains to EOF on the host, and computes Mbps from bytes_received / elapsed. h2g left None with a TODO.
Implements measure_rr_latency and measure_crr_latency in voidbox-network-bench, reusing the single shared VM booted for throughput measurements. RR: guest pipes N bytes over one persistent nc connection; host times each read+write pair (first sample discarded to absorb connect jitter). CRR: guest runs N independent nc invocations; host times each full accept+read+write+close cycle. Both use the existing percentile() helper (dead_code attribute removed). Latency measurements always run regardless of --no-throughput.
Per user feedback: "Slirp" denotes the user-mode-NAT role; "smoltcp" is the underlying library. Role-based naming keeps the public type surface stable across library swaps and matches the symmetry of future TapBackend / VhostNetBackend siblings. Module file src/network/slirp.rs keeps its name (already aligned with the new type, matches src/devices/virtio_net.rs convention).
The actual polling logic now lives in drain_to_guest, which writes
directly into the caller-supplied &mut Vec<Vec<u8>> buffer — no fresh
allocation on every tick. poll becomes a #[deprecated] shim:
#[deprecated(note = "use drain_to_guest")]
pub fn poll(&mut self) -> Vec<Vec<u8>> {
let mut out = Vec::new();
self.drain_to_guest(&mut out);
out
}
Existing call sites (virtio_net.rs, tests/network_baseline.rs,
benches/network.rs) are annotated with #[allow(deprecated)] and a
TODO(0D.4/0D.5) marker. They will be migrated in the next two tasks,
after which the allow attributes can be removed.
Switch VirtioNetDevice::slirp from Arc<Mutex<SlirpStack>> to Arc<Mutex<dyn NetworkBackend>>, replacing the deprecated poll() call in get_rx_frames with drain_to_guest into a reused rx_scratch buffer. Update both VMM cold-boot and snapshot-restore construction sites to coerce Arc<Mutex<SlirpStack>> to the trait object. All 14 baseline tests pass; fmt and clippy clean.
Type rename only — the slirp.rs module file keeps its name. SlirpBackend reflects the user-mode-NAT role rather than the underlying smoltcp library, keeping naming symmetric with future TapBackend / VhostNetBackend siblings.
Introduces the types and helper needed for ICMP echo NAT (Phase 1):
- IcmpEchoKey {guest_id, dst_ip}: hash key for the echo NAT table.
- IcmpEchoEntry {sock, guest_id, last_activity}: per-request state.
- open_icmp_socket(): opens SOCK_DGRAM/IPPROTO_ICMP (no CAP_NET_RAW).
- icmp_echo: HashMap<IcmpEchoKey, IcmpEchoEntry> field on SlirpBackend,
initialized to HashMap::new() in with_security() (the canonical ctor;
new() and Default both delegate through it).
No behavior change — handle_ipv4_frame is untouched, the map stays
empty. Dead-code allowances are scoped to the new items and will be
removed once tasks 1.2/1.3 wire them in.
Divan microbench (`tcp_rx_latency_one_packet`) measures the SLIRP-layer per-packet dispatch cost when one TCP flow is Established and the host kernel has data ready: one zero-timeout epoll_wait + readiness scan + peek + Ethernet frame construction. Measured median on this host: ~9.8 µs per drain_to_guest call. Pre-6.4 the relay iterated every flow in flow_table unconditionally regardless of readiness. Post-6.4 it dispatches only the flows with an epoll EPOLLIN event, reducing wasted work on idle flows to zero. This bench is the regression anchor for that change. The bench is gated on `--features bench-helpers` (like the existing `tcp_inbound_syn_ack_transition` and `synthesize_inbound_syn` benches). It performs a full 3-way handshake outside the timed loop so only the hot relay path is measured. Note: this bench cannot exercise the net_poll_thread 50 ms epoll cycle (that thread does not run inside divan). The wall-clock host→guest latency floor is the province of voidbox-network-bench's `tcp_rx_latency_us_p50` field. That field is added to the Report struct in this commit but returns None (deferred): wiring a guest-side listener requires either a guest daemon or an additional exec RPC — both out of scope for Phase 6.4. The divan microbench is the primary numerical deliverable for this phase.
net_poll_thread holds the EpollDispatch mutex for the full 50 ms of its blocking wait. drain_to_guest's own non-blocking wait_with_timeout(ZERO) call contended on the same mutex, serializing the vCPU thread behind the net-poll thread. voidbox-network-bench saw TCP g2h throughput drop from ~1885 Mbps to ~44 Mbps (40× regression). Fix: SlirpBackend gets a small Mutex<Vec<EpollEvent>> queue. net_poll_thread pushes events into it after each successful wait_with_timeout. drain_to_guest drains the queue (brief uncontended lock) without touching EpollDispatch. A try_lock fallback path serves unit tests (no net_poll_thread) without blocking on the mutex. NetworkBackend trait gains a push_ready_events default-no-op so SlirpBackend can override it; VirtioNetDevice exposes push_events_to_backend as the trampoline called by net_poll_thread. Off-CPU profile evidence: drain_to_guest was 9% off-CPU (29.7s in a 60s window) waiting on the epoll mutex; should drop to near-zero post-fix.
relay_tcp_nat_data's Pass 1 unconditionally copied every TCP FlowKey into a Vec to scan for Closed entries on every drain call. Cache misses 47/1K under load; poll_with_n_flows/100 regressed +246% (130ns → 450ns), /1000 regressed +220%. Fix: when handle_tcp_frame's FIN/RST handlers and mid-function error paths set state=Closed, push the key onto a pending_close Vec. relay_tcp_nat_data drains this Vec at the top of its single ready-events pass — no O(n) collect required. Idle-timeout detection retains a direct flow_table iteration but without allocating a separate key Vec.
…_guest The initial Bug A fix used pending_events.lock() + try_lock(epoll) in drain_to_guest's fast path, adding ~150ns overhead per call vs Phase 6.4 (one extra Mutex acquire). This showed as +38% regression in poll_idle bench (441ns → 611ns). Revised approach: try_lock epoll first (zero cost when uncontended — tests, benches, idle net-poll thread). On Err (net_poll_thread holds the mutex for 50ms), drain pending_events instead. In production the try_lock fails ~once per 50ms window; in tests it always succeeds. Net result: drain_to_guest overhead matches Phase 6.4 when epoll is uncontended; contention eliminated when net_poll_thread is actively waiting.
Pre-Phase-6.4, net_poll_thread woke unconditionally every 5 ms, so
every ACK queued in inject_to_guest by handle_tcp_frame got flushed
within 5 ms. Phase 6.4's epoll_wait(50 ms) waits for FD readiness
events — but a guest writing data has no FD-side signal (the guest
is the writer; the SLIRP-side socket only becomes readable when the
host responds). So queued ACKs sat 50 ms before being flushed; TCP
send window stalled; voidbox-network-bench TCP g2h dropped from
~1885 Mbps to ~225 Mbps even after the mutex-contention fix.
Fix: track inject_to_guest length around process_guest_frame's
ethertype dispatch. If the call queued any frames, call
epoll_waker.wake() — one byte to the non-blocking self-pipe, which
unblocks net_poll_thread's epoll_wait so the queued frames flush
within microseconds.
Also fixes the related drain_to_guest event-source ordering bug:
pending_events (filled by net_poll_thread) is now ALWAYS drained
first, with the non-blocking epoll poll only running as a fallback
when the queue is empty (test/bench paths without net_poll_thread).
The previous code took the try_lock branch when net-poll was
between iterations and silently dropped events the net-poll
thread had already pushed.
voidbox-network-bench post-fix:
g2h: ~6000 Mbps (vs master 1885; +3.2x)
bulk-g2h: ~3900 Mbps (vs master 1565; +2.5x at SO_RCVBUF=4096)
rr p50: 2 us (parity with master)
crr p50: ~50 ms (5x regression vs master ~10 ms — separate
bug, tracked in follow-up; the 50 ms is
exactly one epoll_wait cycle and points
to a connection-establishment latency
issue independent of the throughput path)
CRR p50 was regressing +40 ms (10 ms → 51 ms) post-Phase-6.4. The +40 ms exactly matches Linux's TCP delayed-ACK timer, and the cause is that Phase 6.4 widened the net-poll IRQ re-pulse cadence from 5 ms to 50 ms. The Linux guest spends most idle time in HLT and relies on regular vCPU scheduling slots — driven by our IRQ pulses — to advance its TCP delayed-ACK timer. At 50 ms cadence the guest's pure ACKs ride the next event-triggered IRQ, which can be 40+ ms away. At 5 ms the housekeeping cadence mirrors pre-6.4 and the timer fires on schedule. We lose Phase 6.4's headline "10x idle-wakeup reduction" goal but fast-path events still wake immediately via epoll readiness — so the net win vs master is unchanged: g2h throughput +250%, bulk throughput +250%, RR parity, CRR parity. voidbox-network-bench post-fix: g2h: ~6500 Mbps (vs master 1885; +247%) bulk-g2h: ~5400 Mbps (vs master 1565; +245%) rr p50: ~3 us (parity) crr p50: ~10100 us (parity — back to baseline 10 ms)
Per AGENTS.md doc-comment style ("avoid ticket IDs and PR/commit
references inside doc comments and inline comments — they belong
in commit messages and PR descriptions where they're audit trail;
in code they age into noise as the ticketing context evolves").
Phase references fall into the same category. Comments are
rewritten in present tense to explain the structural reasoning
without referencing when each piece landed. Identifiers like
test names and BROKEN_ON_PURPOSE markers are unchanged.
Plan/spec docs in docs/superpowers/plans/ are intentionally
untouched — phase references there ARE the audit trail.
Recovers Phase 6.4's headline 10x idle-wakeup reduction without
re-introducing the +40 ms CRR regression that forced the cadence
back to a fixed 5 ms.
The adaptive policy:
- last cycle had any kernel event → next timeout 5 ms (active)
- last cycle timed out (no events) → next timeout 50 ms (idle)
A single quiet cycle drops us to idle; a single event puts us back
in active in the next cycle.
The subtlety that motivated the additional EpollDispatch change:
when the vCPU thread calls epoll_waker.wake() during a 50 ms idle
wait, the kernel's epoll_wait returns with the self-pipe event.
wait_with_timeout filters that event out and drains the pipe — so
`epoll_events.is_empty()` would have remained true, and the naive
"is_empty ⇒ idle" predicate kept us at 50 ms forever, regressing
CRR p50 back to ~50 ms.
wait_with_timeout now returns the *raw* kernel count (including
self-pipe wakes) so the adaptive policy treats wakes as activity.
Filtered events still arrive in the out parameter unchanged; only
the return value's meaning shifted from "observable count" to
"raw count," which all existing callers ignore.
voidbox-network-bench post-fix:
g2h: ~6680 Mbps (vs 5 ms fixed: 6500; vs master: +254%)
bulk-g2h: ~5550 Mbps (vs 5 ms fixed: 5400; vs master: +254%)
rr p50: 1 us (in 99-sample iteration; parity)
crr p50: ~10100 us (parity preserved — adaptive correctly
holds 5 ms cadence during connection
bursts because each connection's wake()
keeps raw_kernel_events > 0)
Idle CPU dropped: profile-pre showed net_poll_thread on-CPU 4.93 %
of total at fixed 5 ms cadence (200 wakes/sec); adaptive should
drop to ~10x lower during idle stretches between iterations.
There was a problem hiding this comment.
Pull request overview
This PR (Phase 6.4) rewrites the VM’s host→guest RX polling path from a timer-driven loop to an epoll-driven dispatch with an adaptive timeout, aiming to reduce unnecessary wakeups while improving throughput/latency under load. It introduces a shared EpollDispatch owned by SlirpBackend, wires readiness events through net_poll_thread, and updates tests/benches/docs to match the new design.
Changes:
- Added a Linux
EpollDispatch(epoll fd + self-pipe waker) and integrated it intoSlirpBackendfor flow registration and readiness-driven relay. - Reworked
net_poll_threadto block on epoll with adaptive 5ms/50ms timeouts and forward ready events into the backend via a per-backend queue. - Updated baseline tests, microbenches, and plan documentation to reflect the new event-driven polling architecture and snapshot/restore epoll rebuild expectations.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/network_baseline.rs | Updates drain helper, retransmit semantics, and adds an epoll rebuild smoke test (feature-gated). |
| src/vmm/mod.rs | Replaces fixed sleep polling with epoll-driven net poll thread and adaptive timeout logic. |
| src/network/slirp.rs | Adds epoll dispatcher/waker, registers host FDs on flow creation, and drives relay via readiness events + event queue. |
| src/network/nat.rs | Documentation-only wording updates. |
| src/network/mod.rs | Exposes epoll hooks (epoll_arc, push_ready_events) on NetworkBackend (Linux-only). |
| src/network/epoll_dispatch.rs | New epoll wrapper with register/unregister, wait_with_timeout, and self-pipe Waker. |
| src/devices/virtio_net.rs | Plumbs epoll access/events forwarding from device to backend (Linux-only). |
| src/daemon.rs | Removes phase wording from a comment. |
| src/bin/voidbox-startup-bench/main.rs | Removes phase wording from output strings. |
| src/bin/voidbox-network-bench/main.rs | Updates backpressure-related documentation strings; adds placeholder RX latency field. |
| docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.4.md | Adds the detailed Phase 6.4 implementation plan. |
| benches/network.rs | Updates comments and adds a bench for one-packet RX latency cost under epoll dispatch. |
Comments suppressed due to low confidence (1)
src/network/slirp.rs:1907
relay_icmp_echono longer performs a periodic idle-timeout sweep; it only checksICMP_IDLE_TIMEOUTfor flows that appear in thereadyset. ICMP flows that go silent (no more readiness events) will never be reaped, leaving sockets registered in epoll indefinitely. Reintroduce a lightweight sweep overflow_table(similar to UDP’sstalescan) or track expirations separately so idle ICMP entries are eventually evicted even without further traffic.
/// Drain replies from each active ICMP echo socket and emit echo-reply
/// frames to the guest, driven by epoll readiness.
///
/// Only flows whose token appears in `ready` with EPOLLIN set are visited.
/// Entries idle longer than `ICMP_IDLE_TIMEOUT` are still evicted on any
/// readiness event for that flow.
fn relay_icmp_echo(&mut self, ready: &[EpollEvent]) {
const ICMP_IDLE_TIMEOUT: Duration = Duration::from_secs(60);
let now = Instant::now();
let flow_keys: Vec<FlowKey> = ready
.iter()
.filter(|ev| ev.readable && ev.token & PROTO_TAG_MASK == PROTO_TAG_ICMP)
.filter_map(|ev| {
self.flow_table.keys().copied().find(|fk| {
if let FlowKey::IcmpEcho(icmp_key) = fk {
flow_token_for_icmp(icmp_key) == ev.token
} else {
false
}
})
})
.collect();
for flow_key in flow_keys {
let FlowKey::IcmpEcho(key) = flow_key else {
continue;
};
let frame = {
let Some(FlowEntry::IcmpEcho(entry)) = self.flow_table.get_mut(&flow_key) else {
continue;
};
if now.duration_since(entry.last_activity) > ICMP_IDLE_TIMEOUT {
None // mark for removal below
} else {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1219
to
+1227
| if let Some(host_fd) = new_host_fd { | ||
| let token = flow_token_for_udp(&key); | ||
| self.epoll | ||
| .lock() | ||
| .unwrap() | ||
| .register(host_fd, token, true, false) | ||
| .ok(); | ||
| self.epoll_waker.wake(); | ||
| } |
Comment on lines
+1298
to
+1306
| if let Some(host_fd) = new_icmp_fd { | ||
| let token = flow_token_for_icmp(&key); | ||
| self.epoll | ||
| .lock() | ||
| .unwrap() | ||
| .register(host_fd, token, true, false) | ||
| .ok(); | ||
| self.epoll_waker.wake(); | ||
| } |
Comment on lines
+1725
to
+1736
| // Idle-timeout sweep: scan flow_table once without collecting a | ||
| // separate key Vec. 300-second inactivity applies regardless of epoll | ||
| // readiness; this is O(n) in the number of TCP flows but has no | ||
| // heap allocation overhead. | ||
| const TCP_IDLE_TIMEOUT: Duration = Duration::from_secs(300); | ||
| for (flow_key, entry) in &self.flow_table { | ||
| if let FlowEntry::Tcp(tcp_entry) = entry { | ||
| if tcp_entry.last_activity.elapsed() > TCP_IDLE_TIMEOUT | ||
| && !to_remove.contains(flow_key) | ||
| { | ||
| to_remove.push(*flow_key); | ||
| } |
3 tasks
flow_token_for_tcp/udp truncated dst_ip to 16 bits; flow_token_for_icmp omitted dst_ip entirely. Multiple flows could collide on the same token, mis-routing readiness events to the wrong FlowKey. Replace the lossy encoding with a monotonic AtomicU64 counter per backend. Tokens are still tagged in the high byte for protocol demux (PROTO_TAG_TCP/UDP/ICMP); the lower 56 bits are unique. A new token_to_key HashMap makes readiness → FlowKey lookup O(1) instead of the previous linear flow_table scan.
net_poll_thread held the Mutex<EpollDispatch> across the blocking epoll_wait call (up to 50 ms in idle cadence). vCPU register/ unregister paths in handle_tcp_frame (and friends) had to acquire the same mutex and would block behind the wait, stalling guest TCP SYN handling for up to 50 ms during connection setup. epoll_ctl and epoll_wait are kernel-thread-safe on the same epoll fd; the only state requiring synchronization was the self-pipe (now eagerly initialized in EpollDispatch::new) and the registered fd count (now AtomicUsize). EpollDispatch becomes Sync without an external Mutex — the type changes from Arc<Mutex<EpollDispatch>> to Arc<EpollDispatch>. register/unregister run lock-free against the wait thread; only the kernel's per-epoll-fd internal lock serializes, and that's a fast path.
to_remove.contains() inside the idle-timeout loop was O(n*k) under churn. Switch the membership check to a HashSet<FlowKey> and only materialize the Vec once at the end for the removal loop.
Apply project rust-style rules to the recently-landed Phase 6.4
code (token rewrite + lock-free EpollDispatch refactor):
1. RegisterMode enum replaces (readable: bool, writable: bool) on
EpollDispatch::register. Closed-set policy at the call site
(Read / Write / ReadWrite) over two opaque booleans.
2. matches!() removed at three sites — the project guide prefers
full match (or boolean ==) for compiler diagnostics if the
matched type changes. The unprivileged-ICMP errno check now
uses == comparisons; FlowKey::Tcp counter uses a for loop.
3. Iterator chains in relay loops rewritten as for loops with
mutable accumulators per the project rule. relay_tcp_nat_data,
relay_udp_flows, relay_icmp_echo all touched. Logic unchanged;
control flow now reads top-down without a chain of
.filter().filter_map().collect().
4. Local renamed `rc` → `epoll_ctl_result` at three EpollDispatch
sites. Role-bearing names are required in non-tiny scopes.
5. Dropped redundant explanatory comments around the relay loops
("Data relay — only for flows with…", "Skip entries already
queued for…", "Collect ready ICMP flow keys via…"). The code
below them is self-describing. Kept structural "why" comments
(the ICMP idle-sweep rationale, the per-flow socket Drop
contract).
No behavior change. cargo fmt, clippy -D warnings, network_baseline
(18/18), lib network (23/23), and voidbox-network-bench wall-clock
(g2h ~6580 Mbps, CRR ~32 µs) all green.
dc1001b to
dcbf18b
Compare
dpsoft
added a commit
that referenced
this pull request
May 4, 2026
Two artifacts from the rebase onto the new #69 tip: 1. bind_port_forward_listeners was called twice in with_security after the rebase — the second call shadowed the first, which was harmless but emitted an unused-variable warning under -D warnings. Drop the duplicate. 2. The bench-helpers helper insert_synthetic_lastack_entry was missing flow_token in its TcpNatEntry initializer (the field was added by the Copilot fix #1 monotonic-token rewrite that landed on #69 after the original phase 6.1 commit chain was written). Allocate via next_flow_token(PROTO_TAG_TCP).
…ase6.4-epoll # Conflicts: # benches/network.rs # src/bin/voidbox-network-bench/main.rs # src/network/mod.rs # src/network/nat.rs # src/network/slirp.rs # tests/network_baseline.rs
5 tasks
dpsoft
added a commit
that referenced
this pull request
May 6, 2026
…ing (#76) * docs: Phase 6.1 detailed TDD plan — TCP half-close 9 bite-sized tasks covering the half-close state machine: - Established → FinWait1 (guest closed first) → LastAck on host EOF - Established → CloseWait (host closed first) → LastAck on guest FIN - LastAck → Closed on guest's final ACK or LAST_ACK_TIMEOUT (60s) Three new pins: tcp_half_close_guest_writes_first (BROKEN_ON_PURPOSE flips at Task 3), tcp_half_close_host_writes_first (passes after Task 3), tcp_last_ack_timeout_reaps_stale_entry (bench-helpers gated, Task 7). Severity: HIGH — current code drops host's pending response on guest shutdown(SHUT_WR), causing silent data loss for HTTP, SMTP, and any protocol with orderly half-close. * test(network): pin tcp_half_close_guest_writes_first (BROKEN_ON_PURPOSE) Guest sends FIN after data; current code marks state=Closed immediately on guest FIN, so the host's response data is dropped. This pin locks the broken behavior so the Phase 6.1 fix is legible to reviewers. Flips to PASS when Tasks 2–3 implementation lands. * feat(slirp): FinWait1 → LastAck and Established → CloseWait on host EOF Tasks 2+3+4 implemented together: clippy's -D warnings requires our_fin_sent to be used before committing the field addition, so the relay loop changes from Task 3 and the ACK handler from Task 4 are bundled with Task 2's struct changes. Structural changes (TcpNatEntry): - Add last_state_change: Instant — tracks when state last changed, used for LAST_ACK_TIMEOUT reaping (Task 6). - Add our_fin_sent: bool — prevents re-sending FIN on repeated epoll readiness events for the same half-close transition. - Initialize both fields at all three TcpNatEntry literal sites. FIN handler (handle_tcp_frame): - Established → FinWait1: ACK guest FIN, shutdown(Write) host socket so host sees EOF; stay alive for host's pending response data. - CloseWait → LastAck: host already closed; guest just closed; ACK and wait for guest's final ACK of our FIN. ACK handler (handle_tcp_frame): - LastAck → Closed: guest's final ACK reaps the entry. - Extend ACK-driven consume to FinWait1 state so bytes_in_flight drains and recv_peek eventually sees Ok(0) (host EOF). Relay loop (relay_tcp_nat_data): - Extend data relay to FinWait1 in addition to Established. - Ok(0) arm now dispatches on state: Established → CloseWait, FinWait1 → LastAck (instead of immediate Closed). - FIN-emit logic uses our_fin_sent guard and needs_fin predicate (CloseWait | LastAck) to send FIN exactly once per transition. Test: update tcp_half_close_guest_writes_first to ACK received data segments so the ACK-driven consume path drains the kernel buffer and recv_peek sees EOF. Without ACKs the FinWait1 relay sees the same bytes on every peek and never reaches Ok(0). * test(network): pin tcp_half_close_host_writes_first Symmetric mirror of tcp_half_close_guest_writes_first: host writes GREETING, shuts down its write side (Established → CloseWait), and waits for guest to reply; guest sends REPLY data + FIN. Also fix two implementation gaps found while writing this pin: 1. Guest→host data forwarding in CloseWait: the payload forward guard was Established-only. CloseWait must also forward guest data (host closed its write side but can still read). Extend the predicate to include CloseWait. 2. shutdown(Write) on CloseWait → LastAck: when the guest sends FIN in CloseWait the host application is blocked on read_to_end. We must call shutdown(Write) on host_stream so the kernel delivers EOF to the host. Without this the host blocks forever. * feat(slirp): LAST_ACK_TIMEOUT reaping prevents LastAck entry leak Add LAST_ACK_TIMEOUT = 60 s (TCP MSL × 2) as a module-scope constant. Merge the LastAck-timeout check into the existing idle-timeout sweep in relay_tcp_nat_data, making a single O(n) pass handle both conditions: - Standard idle-timeout (300 s of inactivity, any state): unchanged. - LastAck-timeout (60 s since last_state_change, LastAck state): reaps entries where the guest's final ACK never arrived, e.g. after a guest crash. Logs at WARN so operators can connect the timeout to a prior half-close sequence without a separate debugging session. The combined sweep avoids a second loop and an extra to_remove Vec allocation. Using last_state_change (set on every state transition) rather than last_activity (set on data relay) gives an accurate 60 s window from the moment we sent our FIN. * test(network): pin tcp_last_ack_timeout_reaps_stale_entry (bench-helpers) Add a bench-helpers–gated pin that verifies the LAST_ACK_TIMEOUT reaper: 1. Insert a synthetic LastAck entry via the new bench-helpers helper insert_synthetic_lastack_entry (TcpNatState::LastAck, our_fin_sent=true). 2. Back-date last_state_change by 70 s (> LAST_ACK_TIMEOUT = 60 s) via set_synthetic_last_state_change. 3. One drain_to_guest cycle runs the timeout sweep. 4. Assert the entry is gone (tcp_flow_state returns None). Two new bench-helpers methods on SlirpBackend: - insert_synthetic_lastack_entry: seeds a LastAck flow without a full half-close exchange, for timeout reaping tests. - set_synthetic_last_state_change: back-dates last_state_change on an existing entry to simulate an expired LAST_ACK_TIMEOUT. Also widen TcpNatState and tcp_flow_state to pub so the bench-helpers– gated external test can reference them by path. Both were previously pub(crate); they are now observable as part of the bench-helpers surface. * refactor(slirp): drop allow(dead_code) on TcpNatState; remove unused FinWait2 All TcpNatState variants are now wired into the state machine: - SynReceived, SynSent: handshake paths (unchanged) - Established: normal data transfer (unchanged) - FinWait1: guest half-closed; relay continues forwarding host response - CloseWait: host half-closed; relay continues forwarding guest data - LastAck: both sides closed; waiting for guest's final ACK or timeout - Closed: entry pending removal Remove FinWait2: distinguishing it from FinWait1 would require observing per-segment ACKs from the kernel — the relay does not track those. The relay stays in FinWait1 until host EOF, then jumps to LastAck. If the distinction is needed later, the variant can be re-added. Update the TcpNatState doc comment to document the state machine diagram and the FinWait2 omission rationale, replacing the old per-variant stubs. * fix(slirp): drop duplicate bind + init flow_token in lastack helper Two artifacts from the rebase onto the new #69 tip: 1. bind_port_forward_listeners was called twice in with_security after the rebase — the second call shadowed the first, which was harmless but emitted an unused-variable warning under -D warnings. Drop the duplicate. 2. The bench-helpers helper insert_synthetic_lastack_entry was missing flow_token in its TcpNatEntry initializer (the field was added by the Copilot fix #1 monotonic-token rewrite that landed on #69 after the original phase 6.1 commit chain was written). Allocate via next_flow_token(PROTO_TAG_TCP). * fix(slirp): doc-link to bench-helpers method via Self:: A bare intra-doc-link (`[`set_synthetic_last_state_change`]`) doesn't resolve to a struct method even when both items live in the same `impl` block — rustdoc looks up bare names in the module scope, not the enclosing impl. CI's `-D warnings` then elevates the `broken_intra_doc_links` warning to an error and fails the Documentation job. Switch to `Self::set_synthetic_last_state_change` so rustdoc resolves through the impl.
This was referenced May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: DRAFT. Stacked on top of PR #68 (`smoltcp-passt-port-phase0` — Phases 0–5 + 5.5b). Implements Phase 6.4 of `docs/superpowers/plans/2026-04-30-smoltcp-passt-port.md` per the detailed TDD plan in `docs/superpowers/plans/2026-04-30-smoltcp-passt-port-phase6.4.md`.
What this branch does
Replaces `net_poll_thread`'s timer-driven `sleep(5 ms)` cycle with an event-driven `epoll_wait` dispatch using an adaptive timeout (5 ms when active, 50 ms when idle). Host TCP/UDP/ICMP socket FDs are registered with a shared `EpollDispatch` on every `flow_table` insert; `net_poll_thread` blocks on `epoll_wait` and feeds ready events to `SlirpBackend::drain_to_guest` via a small per-backend event queue. A self-pipe `Waker` lets `process_guest_frame` (vCPU thread) wake the poll thread immediately when it queues frames.
Headline perf vs `origin/main` (`voidbox-network-bench --iterations 3 --bulk-mb 10`)
Divan microbench wins (`scripts/bench-compare.sh --baseline 47868f0 --skip-vm`)
The big-N case is the design's headline win — only ready flows get visited, no full-table scan:
Adaptive timeout: how it preserves correctness
The naive `epoll_wait(5 ms fixed)` recovered CRR but burned 200 wakes/sec when idle. The naive `epoll_wait(50 ms fixed)` recovered idle CPU but caused +40 ms CRR latency (Linux delayed-ACK timer).
Adaptive: last cycle had any kernel event ⇒ next timeout = 5 ms (active); last cycle pure timeout ⇒ next timeout = 50 ms (idle). One quiet cycle drops to idle, one event back to active.
The subtle part: when `epoll_waker.wake()` fires during a 50 ms idle wait, the kernel's `epoll_wait` returns with the self-pipe event. `wait_with_timeout` filters that event out — so a naive "is_empty ⇒ idle" predicate would have stayed at 50 ms, regressing CRR. `wait_with_timeout` now returns the raw kernel count (including filtered self-pipe wakes); adaptive treats wakes as activity. Filtered events still arrive in the out parameter unchanged.
Inherent microbench overhead (test/bench fallback path only)
Microbenches without a `net_poll_thread` fall back to `drain_to_guest`'s self-poll path (one mutex op + one non-blocking `epoll_wait` syscall per call, ~310 ns). This does NOT apply to production, where `has_external_poller=true` after the first `push_ready_events` skips the syscall (~50 ns total).
Iteration history
The first cut of Phase 6.4 caused a 40× throughput regression in production (1885 → 44 Mbps). Six fix passes with eBPF profiling (perf-agent) restored it:
Profile evidence (post-fix, 60 s capture during voidbox-network-bench)
Validation
Open follow-ups (deferred to future PRs, NOT blocking)
Test plan for review